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Abstract 

Reputation is a valuable asset in online social lives and it has drawn increased attention. How to evaluate user reputation in 
online rating systems is especially significant due to the existence of spamming attacks. To address this issue, so far, a variety of 
methods have been proposed, including network-based methods, quality-based methods and group-based ranking method. In this 
paper, we propose an iterative group-based ranking (IGR) method by introducing an iterative reputation-allocation process into the 
original group-based ranking (GR) method. More specifically, users with higher reputation have higher weights in dominating the 
corresponding group sizes. The reputation of users and the corresponding group sizes are iteratively updated until they become 
stable. Results on two real data sets suggest that the proposed IGR method has better performance and its robustness is considerably 
improved comparing with the original GR method. Our work highlights the positive role of users’ grouping behavior towards a 
better reputation evaluation. 
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1. Introduction 


At the age of Internet, individual reputation plays the role 
of fundamental blocks in building up online ecosystems, espe¬ 
cially in the filed of e-commerce 111 [j] . Meanwhile, new chal¬ 
lenges arise that how to create and maintain reputation in on¬ 
line communities? To better uncover objects’ true quality, many 
platforms implement online rating systems, e.g. Amazon, eBay, 
Taobao, MovieLens, where users can give their feedbacks by 
assigning ratings to objects Si. The ratings provide a direct 
measure of r^utation for the objects and further affect users’ 
decisions ll5l[fl[7l[8|]. Usually, high ratings result in high sales 
whereas low ratings play the opposite role. As a result, to ex¬ 
tract credible information from these abundant feedbacks is be¬ 
coming a major challenge since noisy ratings are widely existed 
in practical systems iSllflllll]- For example, some users may 


give unreasonable ratings due to their poor judgement fl^ fl^. 
and some others may purpose fully gu ide public choices by giv¬ 
ing maximal/minimal ratings [iSD . These noisy ratings can 
harm the effectiveness of online ratin g sy stems and affect the 
accuracy of the obtained information Il3, [3 li§|]. Therefore, 


how to measure users’ credibility, filter out untrusted users and 
ensure reliability of online rating systems are becoming urgent 
tasks rai^lTin. 

To cope with these concerns, online reputation systems are 
introduced il [^ . These systems are capable of decision 
support for Internet-mediated services and help to maintain the 
healthy development of online rating systems and recommender 
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systems. As the core of reputation systems, a variety of user 
reputation evaluation methods have been proposed 10 Q, 
where each user is assigned with a reputation value based on 


their rating behaviors |26L l27l]. Typically, these previous meth¬ 
ods can be divided into three categories: 


• Network-based methods. As online rating systems can 
be described by bipartite networks ll^ . the reputation for 
users can be calculated by many existing networked rank¬ 
ing methods such as PageRank 0, LeaderRank ii, 
mass diffuse 0111 and heat conduction . In these 
methods, a user’s reputation is measured by the amount 
of resources that the user receives in the resource-allocation 
processes. Although these methods are very efficient, 
they suffer from rating noises and thus have limited per¬ 
formance 0. As a result, these methods are not suitable 
for user reputation evaluation in bipartite networks. 


• Quality-based methods. Underlying an assumption that 
each object has a most objective rating that best reflects 
its quality 0 , the quality-based methods measure a user’s 
reputation by the difference between the rating values 
and the estimated objects’ quality values 10. These 
methods include iterative refinement (IR) method ll^ . 
an improved IR method 01, correlation-based ranking 
(CR) method |[0, reputation redistribution ranking (RR) 
method 0 and the other seven methods 00 ], These 
aforementioned methods are well-performed in user rep¬ 
utation evaluation, however, some of them may not con¬ 
verge and some others are not robust to spamming attacks 
12^1^. More importantly, due to the fact that the on¬ 
line rating system is fundamentally a socialized informa- 
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tion collection platform, one object should accept multi¬ 
ple reasonable ratings llSdIl since the ratings are subjective 
and can be affected by users’ background and some other 
factors Therefore, the underlying assumption of 

quality-based methods is worthy of scrutiny. 

• Group-based method. Recently, a group-based ranking 
(GR) method is proposed, in which users are grouped 
based on their rating similarities, and users’ reputation is 
calculated by the corresponding group sizes 114ill . Users 
are assigned with high reputation if they always fall into 
large rating groups. This method is free from the assump¬ 
tion of the quality-based methods and it has better per¬ 
formance in evaluating user reputation on data sets with 
spamming attacks. However, the method is not robust for 
plenty of large-degree spammers as it’s one-step process 
and the ratings are evenly contributed in calculating the 
corresponding group sizes regardless of users’ reputation. 

In this paper, we propose an iterative group-based ranking 
(IGR) method by introducing an iterative reputation-allocation 
process into the original GR method. Specifically, ratings from 
users with high reputation are assigned with higher weights in 
calculating the corresponding group sizes. Both the user rep¬ 
utation and the group sizes are iteratively updated until they 
become stable. This method is partially inspired by the GR 
method the original resource-allocation process 1 ^ , 
and the HITS algorithm with iterative refinement procedure . 

When tested on two real data sets (MoiveLens and Netfiix) with 
artificial spammers, the proposed IGR method has excellent 
performance in evaluating user reputation and its robustness in 
resisting a large number of spammng attacks is considerably 
improved compared with the original GR method. Further, pro¬ 
vided some insights on the mechanism and analyzed the char¬ 
acteristics of these methods. Results suggest that IR method 
remarkably prefers large-degree users, CR and RR methods 
have no obvious degree preference, and GR and IGR methods 
slightly prefer small-degree users. Our work provides a fur¬ 
ther understanding on some reputation evaluation methods and 
highlights the significance of considering users’ grouping be¬ 
haviors in designing better reputation systems. 


2. Methods 

We first introduce some basic notations for the user reputa¬ 
tion evaluation methods. The online rating system can be nat¬ 
urally described by a weighed bipartite network G = {U,0,E}, 
where U = {Ui, f/ 2 ,O = {Oi, O 2 ,0„} and £" = 
{Ei,E 2 , El} are sets of users, objects and ratings (see Fig.[Tt 
for an illustration), respectively. Here, we use Greek and Latin 
letters, respectively, for object-related and user-related indices 
to distinguish them. The degree of a user i and an object a are 
denoted as ki and respectively. Considering a discrete rat¬ 
ing system, the bipartite network can be represented by a rating 
matrix A, where the element Aia e kl = { 6 c)i, 60 ) 2 , oJz) is the 
weight of the link connecting user i and object a, with Aia be¬ 
ing equal to the corresponding rating value (see Fig. [TJ)). In a 


reputation system, each user i will be assigned with a reputation 
value, which is denoted as Ri. In the following, we will briefiy 
introduce the proposed user reputation evaluation method. 


2.7. Group-based ranking methods 

The iterative group-based ranking (IGR) method and the 
original group-based ranking (GR) method are based on the 
same framework. Thus, we mainly introduce the IGR method. 
After the initial configuration that each user i has equal reputa¬ 
tion, e.g., Ri = I, the IGR method works as follows. 

Firstly, for user /, the rating vector A/ is mapped to a rating- 
object matrix whose element is defined as 

if Aia=COs 

otherwise ’ ^ ^ 

where the symbol stands for a non-value, which should be 
ignored in the calculation (the same below). In this way, users 
are grouped by their ratings, namely, users who give the same 
rating ojs to object a belong to the group F^^,. Mathematically, 
the group is defined as Tga = {Ui\B^l = 1}. Obviously, user i 
belongs to kt different groups. 

Secondly, based on the intuition that a user with poor rep¬ 
utation should have less chance in forming big groups, we cal¬ 
culate the size of group Tga by considering both the rating- 
object matrix B^^^ and users’ reputation Rf. Mathematically, the 
weighted group size Asa is defined as 



= (2) 

where m is the number of users. Then, a rating-rewarding ma¬ 
trix A* is established by normalizing matrix A by column. Math¬ 
ematically, A*^ = Asalka. 

Thirdly, referring to the rating-rewarding matrix A*, the 
original rating matrix A is mapped to a rewarding matrix A'. 
Specifically, the rewarding A'^ that user i obtains from the rat¬ 
ing Aia is defined as 

if 

otherwise 

Finally, the reputation is re-allocated to all users according 
to their rewarding vectors. On the one side, if the average of 
a user’s rewarding is small, most of his ratings must be devi¬ 
ated from the majority, indicating his/her poor reputation. On 
the other side, if the rewarding varies largely, he/she is also un¬ 
trustworthy for the unstable rating behavior. Based on these 
intuitions, the reputation Rf for user i is calculated as 

' ZaeO^klA'^-kiXaeoAiJ^’ 

where p and a are mean value and standard deviation, respec¬ 
tively. 

In IGR, the reputation R and the group size A are iteratively 
updated according to Eqs. Q, 0 and ^ until the change of the 
reputation \R-R'\ = is smaller than the threshold 

value A = 10“"^. Here, R' denotes the reputation vector at the 
previous iteration step. Note that, when there is no iteration, 
IGR degenerates to the original GR. A visual representation of 
the IGR method is shown in Fig.[T] 



2 





(b) 

0 , 

0 , 

O 3 G 4 



5 

5 

2 

5“ 



- 

4 

2 

5 

u. 

A = 

2 

5 

3 

1 



5 

4 

- 5 

U, 


1 

1 

5 

1 

Us 



i® 





(f) 


A' = 


0 , 

O 2 

O 3 

^4 



'(e) 

0 . 

O 2 

O 3 

^4 

Q 


'(d) 

0 , 

0 , 

0 ,04 

Q 

'0.50 

0.40 

0.50 

0.60' 

U, 



'0.25 

0.20 

- 

0.40" 

I 



1 

1 

- 2 

I 

- 

0.40 

0.50 

0.60 

u. 



0.25 

- 

0.50 

- 

2 



1 

- 

2 - 

2 

0.25 

0.40 

0.25 

0.40 

u. 


A* = 

- 

- 

0.25 

- 

3 


> 

II 

M 

X 

II 

- 

- 

1 - 

3 

0.50 

0.40 

- 

0.60 

U, 



- 

0.40 

- 

- 

4 


i 

- 

2 

- - 

4 

0.25 

0.20 

0.25 

0.40_ 

Us 



0.50 

0.40 

0.25 

0.60_ 

5 



2 

2 

1 3_ 

5 


Figure 1: Illustration of the IGR method. The number besides the arrow marks the order of the procedure. The symbol in matrixes stands for a non-value, which 
should be ignored in the calculation, (a) The original weighed bipartite network, G. (b) The corresponding rating matrix, A. The row and column correspond to 
users and objects, respectively, (c) The rating-object matrix for user i, Taking C /4 as an example (blue horizontal box in (b)), = 5^"^] = 1. (d) The 

reputation-weighted group size matrix, A. Taking O 2 as an example (green vertical box in (b)), A4^2 = RiX + ^4 x ^ 4 ^ 2 “^- rating-rewarding matrix, 

A*, constructed by normalizing A by column, e.g, A 42 = 2/(1 - 1 - 2 - 1 - 2) = 0.40. (f) The rewarding matrix. A', obtained by mapping matrix A referring to A*, e.g. 
A '^2 - (g) The reputation of users, R. R' is temporal reputation in the previous iteration step. In IGR method, A and R' are iteratively updated according to 

(d), (e), (f) and (g), as indicated by the red arrows. Finally, a stable reputation R is obtained. 


2.2. Quality-based ranking methods 

Quality-based ranking methods have an underlying assump¬ 
tion that each object a is associated with a most objective rating 
that best reflects its true quality Qa. As it’s really hard to tell the 
true quality of objects, as an alternative, the estimated quality 
Qa of object a is usually used, which is deflned as the objects’ 
weighted average rating. Mathematically, it reads 


Qa = 


^ieUa 


(5) 


where Ua is the set of users who have rated object a, and Afa 
is the rating to object a from user i with reputation Rf. Here, 
we consider three representative qualit y-b ased ranking meth¬ 
ods, namely, iterative refinement (IR) 11361] . correlation-based 
ranking (CR) reputation redistribution ranking (RR) 

The IR method calculates the user reputation and object 
quality in an iterative way. Specifically, a user’s reputation is in¬ 
versely proportional to the difference between the rating vector 
and the corresponding objects’ estimated quality vector. Math¬ 
ematically, the difference is deflned as 

fi = I y.(Aia - Qa)\ (6) 


where Qa is the estimated quality value of object a. Initially, 
all users have the same reputation, e.g., Ri = 1. Then, the repu¬ 
tation of user i is iteratively updated according to 

Ri = {fi + s)-P, (7) 


where B is a tunable parameter, whose optimal value is around 
P = \ 1 I 39 I] . The iteration goes according to Eqs. (0, ® and O 
until both Qa and Ri converge. 

As CR and RR methods are based on the same framework, 
in the following, only RR is introduced. In RR, each user i 
is initially with reputation Rt = kiln, which can be essentially 
seen as the user’s activity. The estimated quality of objects is 
calculated by Eq. (O. To obtain the reputation Rt for user i in a 
step, a so-called temporal reputation TRf is calculated, which is 
the Pearson correlation coefficient between the rating vector Ai 
and the estimated objects’ quality vector Qt. Mathematically, 
TRi is defined as 


* aeOi 


^ia Qa B^Qi) 


o-{Ai) 


(r(Qi) 


( 8 ) 


where yu and cr are functions of mean value and standard devi¬ 
ation, respectively. If TRi is smaller than 0, TRi is reset as 0, 
leading TRi being in the range [0,1]. Then, the reputation Ri is 
obtained by nonlinearly redistributing TRi via 


Ri = TR^^ 


^jTRj 

T^jTRy 


(9) 


where ^ is a tunable parameter. Note that RR degenerates to CR 
when ^ = 1. In each step, both Qa and Ri are updated until the 
change of the estimated quality \Q- Q'\ = ^aeO (Q(^^ Qa) 
is smaller than a threshold value A = 10“"^. Here, Q' denotes 
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Table 1: Some basic characteristics of real data sets, m is the number of users, n 
is the number of objects, {ku) is the average degree of users, {ko) is the average 
degree of objects, and S = limn is the sparsity of the bipartite network, where 
/ is the number of all ratings. 


Data set 

m 

n 

{ku) 

{ko) 

5 

MovieLens 

943 

1682 

106 

60 

0.0630 

Netflix 

3000 

2779 

66 

71 

0.0237 


the vector of objects’ qualities in the previous step, and the pa¬ 
rameter 6 is set as its optimal value 6 = ?> 


3. Data and metric 

3.1. Real rating data 

We consider two commonly used data sets in online rat¬ 
ing systems, namely, MovieLens and Netflix. Both of the two 
data sets contain ratings on movies based on a 5-point rating 
scale with 1 being the worst and 5 being the best. Movie- 
Lens data set is provided by GroupLens project at University of 
Minnesota (www.grouplens.org). Herein, we only use a small 
subset, which is sampled and extracted from the original data 
with the constraint that each user has at least 20 ratings and the 
movies are rated by at least one of these users. In the subset, 
100000 ratings are given by 943 users to 1682 movies. Netflix 
is a huge data set released by the DVD rental company Net¬ 
flix for its Netflix Prize contest (www.netflixprize.com). We 
extracted a small data set by random choosing 3000 users who 
have at least 20 ratings and took all 2779 movies that rated by 
at least one of these users. Finally, there are 197248 ratings 
in the Netflix data set. Compared with Netflix, MovieLens has 
larger average user degree, smaller average object degree and 
higher sparsity. The basic statistics of data sets are summarized 
in Table [U 


3.2. Artificial rating data 

To test the performance of different ranking methods, one 
way is to calculate the ranks of all users and compare them with 
the ground truth. However, in practice, we are unable to know 
the ground true ranks of users in advance. As an alternative, 
we manipulate the real data set by adding artificial spammers 
and test to what extent these spammers can be detected by a 
ranking method. In fact, two types of distorted ratings, namely, 
malicious ratings and random ratings are widely found in real 
online rating systems ii 1^ . The malicious ratings are from 
spammers who always gives minimum (maximum) allowable 
ratings to push down (up) certain target objects. The random 
ratings mainly come from test engineers or some naughty users 
who give meaningless ratings randomly. 

As real spammers are unknown, to generate artificial rating 
data sets, we add either type of artificial spammers (i.e. ma¬ 
licious or random) at one time into the original data. In the 
implementation, we randomly select d users and turn them into 
spammers by replacing their original ratings with distorted rat¬ 
ings: (i) integer 1 or 5 with the same probability (i.e., 0.5) for 
malicious spammers, and (ii) random integers in {1,2, 3,4,5} 


for random spammers. Thus, the ratio of artificial spammers is 
p = J/m, where m is the number of all users. 


3.3. Evaluation metric 

We apply two widely used metrics to evaluate the perfor¬ 
mance of ranking, namely, recall ii and AUC (the area un¬ 
der the ROC curve) il. The recall only focuses on the top-L 
ranks and its value measures to what extent the spammers can 
be ranked at the top. Mathematically, the recall is defined as 


Rc{L) = 


d\L) 
d ’ 


( 10 ) 


where d'{L) < J is the number of detected artificial spammers 
in the top-L ranking list. In the following experiments, the 
length of ranking list is set as L = J, at which setting recall 
is equivalent to another accuracy metric named precision Si. 
Larger value of Rc indicates higher accuracy of the ranking. 

Next, we introduce the L-independent metric AUC. Given 
the ranks of all users, the value of AUC value can be essentially 
seen as the probability that the reputation of a randomly chosen 
spammer is lower than that of a randomly chosen normal user 
(non-spammer) II] . To calculate AUC, at each time a pair of 
spammer and normal user are picked and their reputations are 
compared. If among N independent comparisons, there are A' 
times the spammer has a lower reputation and N" times they 
have the same reputation, the AUC value is defined as 


AUC = 


N' -r 0.5A" 
N 


( 11 ) 


The value of AUC should be about 0.5 if all users and spam- 
mmers are ranked randomly. Therefore, the more the value of 
AUC exceeds 0.5, the better the ranking method performs. 


3.4. Self-consistency metric 

For the reputation evaluation methods, there is an intuition 
that a user of higher rating error should have a lower reputation 
or vice versa. That is to say, for a well-performed method, the 
reputation should be negatively correlated with the rating error. 
Here, the rating error of a user refers to the degree of deviation 
after comparing the rating Ai and the estimated objects’ quality 
Si. Mathematically, for user /, the rating error Si is defined as 

^ HaeOi \^ia ~ Q(x\ 

Si — : ? ( 12 ) 

ki 

where 0/ is the set of objects being rated by user /, and Qa = 
'Lieua ^ialka is the average rating that object a receives. In fact, 
the correlation between Si and Ri measure the self-consistent of 
a ranking method as Si depends on Q and Q depends on Ri al¬ 
ternately. The higher the correlation is, the more self-consistent 
the method is. 


4. Results 

4.1. Reputation evaluation 

First, we consider the probability distribution of users’ rep¬ 
utation after applying the reputation evaluation methods on the 
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Figure 2: The probability distribution of users’ reputation after applying different reputation evaluation methods on the two real online rating data sets, MovieLens 
and Netfiix. Subfigures (a), (b), (c), (d) and (e) are for MovieLens; subfigures (f), (g), (h), (i) and (j) are for Netflix. R is the reputation of users. 1 - D is the 
Simpson’s index of diversity. 


real online rating data sets. Results are shown in Fig. [2l It 
can be seen that in IR the reputation is Possion-like distributed 
whereas in CR, GR and IGR the reputation is normal-like dis¬ 
tributed. By contrast, in RR the reputation is exponential-like 
distributed, which is remarkably different as the reputation of 
most users is zero (see Figs.|2l: and|2]i). To quantify the diver¬ 
sity of all users’ reputation from the probability distribution, we 
calculate the Simpson’s index of diversity, which is denoted as 
I - D lH^. Higher value ofl-D suggest more distinguishable 
of the obtained reputation. In CR, the values ofl-D are highest 
as 0.9343 and 0.9318 for MovieLens and Netflix, respectively. 
In GR and IGR, the values of 1 - D are nearly the same, which 
are around 0.90 and 0.88 for MovieLens and Netflix, respec¬ 
tively. In RR, the values of 1 - D are the lowest, suggesting 
that the reputation of users’ in RR is the least distinguishable. 
Actually, the reputation a well-performed reputation evaluation 
method assigns should be distinguishable, and CR, GR and IGR 
perform better. 

Then, in Figs. [3^ and[3]i, we show the relation between S 
and R, i.e. the self-consistency, for different methods. We note 
that GR and IGR both assign a high reputation to users of low 
rating errors and a stably low reputation to users of high rating 
errors. By contrast, the other three quality-based ranking meth¬ 
ods, i.e., IR, CR and RR, are not stable in dealing with users of 
high rating errors, as indicated by high variation of R when S 
is large. To quantify the relation, we additionally calculate the 
Pearson correlation coefficient p between R and 6. Results are 
shown in the first row of Table [2j The values of p are respec¬ 
tively -0.8166 and -0.8201 (-0.7353 and -0.7629) for GR and 
IGR in MovieLens (Netflix) data set. The highest negative cor¬ 
relations suggest the best self-consistent of GR and IGR in user 
reputation evaluation. 


We next consider the effect of user degree ku on deter¬ 
mining the corresponding reputation R under different ranking 
methods. Figs. [3j5 and[3t show the relations between ku and 
R. It is worthy noticing that R in IR is positively correlated 
with ku as the correlation is 0.8759 and 0.7868 for MovieLens 
and Netflix, respectively. In fact, the degree ku can be essen¬ 
tially seen as a user’s activity. Thus, the result indicates that IR 
prefers users with high activity as it gives a higher reputation 
to active users than inactive ones. By contrast, for the other 
four methods, there is no obvious degree preference as the cor¬ 
relations are all around 0 (see the second row of Table [2]). The 
main reason for these observations is that R in IR is inversely 
proportional to the least mean square of the difference between 
Aia and Q^. As the difference is degree-dependent, in IR, large- 
degree users get a higher reputation in the iteration. While CR 
and RR calculate the correlation and GR and IGR calculate the 
mean and standard deviation, which are all independent of the 
user degree. In practice, there is another understanding of such 
positive correlation for IR. The user degree can be roughly seen 
as a reflection of buyers’ experiences. Users of larger degree 
receive more information and they are experienced. Hence, it 
can be roughly considered that large degree users have better 
judgement and their reputation should be higher. However, the 
straightforward index is not enough to deal with the problem as 
it’s hard to dig out large degree spammers. 

Further, we study how the degree of trend following affects 
the reputation evaluation. The so-called degree of trend follow¬ 
ing measures to what extent a user would like to collect objects 
of high popularity. Usually, the popularity of an object is repre¬ 
sented by its degree. Hence, a user’s degree of trend following, 
denoted as 0, can be calculated as the average degree of objects 
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Figure 3: The relation between R and 5, ku and 0, respectively. Subfigures (a), (b) and (c) are for MovieLens; subfigures (d), (e) and (f) are for Netflix. 6 is the rating 
error of users, ku is the degree of users, and 0 is the degree of trend following. For comparison, S, ku and 0 are respectively normalized. As the three normalized 
indicators are continuous, we respectively divide them into bins with the length 0.05 and then evaluate the mean reputation of users in the same bins. 


Table 2: Pearson correlation coefficient p between the reputation R and the rating error 5, the degree of users ku and the degree of trend following 0, respectively. 
The highest correlation coefficients in each row are emphasized in bold. 


Metrics 

MovieLens 

Netflix 

IR 

CR 

RR 

GR 

IGR 

IR 

CR 

RR 

GR 

IGR 

p{6,R) 

-0.4471 

-0.4537 

-0.3189 

-0.8166 

-0.8201 

-0.4640 

-0.3926 

-0.2812 

-0.7353 

-0.7629 

p(ku,R) 

0.8759 

0.2318 

0.1719 

-0.0519 

-0.0419 

0.7868 

0.0538 

0.0040 

-0.0950 

-0.0904 

p{(p, R) 

-0.4746 

-0.0244 

-0.0287 

0.2141 

0.2048 

-0.3793 

-0.0428 

-0.0569 

0.2368 

0.2157 


that rated by the user. Mathematically, it reads 


where Oi is the set of objects that rated by user /, kt is the de¬ 
gree of user /, and ka is the degree of object a. The relations 
between the user reputation R and the degree of trend following 
0 are shown in Figs. [3]: and[3f. It can be seen that R in IR is 
negatively correlated with 0 as the values of p are -0.4746 and 
-0.3793 for MovieLens and Netflix, respectively (see the third 
row of Table O. In GR and IGR, R is weak positively corre¬ 
lated with 0 as the value of p is around 0.2. In CR and RR, the 
value of p is around 0, indicating that R is almost independent 
of 0. To better understand these observations, we focus on the 
mechanisms of these methods. In IR, the ratings from a user 
of larger 0 have less chance in dominating the corresponding 
object’s quality, which finally results in the user’s lower repu¬ 
tation. In GR and IGR, a lager 0 ensures a stabler grouping, 
which results in a user’s higher reputation. For a more intu¬ 
itive understanding, we consider the real meaning of the dif¬ 
ferences among the correlation coefficients. Users who always 


buy things of high popularity have public taste and the informa¬ 
tion they receive is popular to audience. Thus, it’s much harder 
for them to get higher reputation compared with the users who 
have their unique taste and richer information in IR. By con¬ 
trast, users of larger degree with trend following have better 
grouping behavior in collecting objects and they should have 
higher reputation in GR and IGR. 

4.2. Random spamming analysis 

To evaluate the performance of different methods in resist¬ 
ing random spamming, we first generate artificial data sets with 
random spammers and then calculate Rc and AUC accordingly. 
Results are shown in Fig.lH When focusing on the top ranks, in¬ 
dicated by the value of Rc in Figs. |4^ and|4j), GR and IGR both 
have the best performance, and IGR is more robust than GR. CR 
is on a par with RR, and they both outperform IR. Further, we 
note that the value of Rc increases as p increases. Specifically, 
the value of Rc has a rapid growth when p is approaching a 
value around 0.05. Afterwards, the value of Rc becomes stable. 
The result suggests that there are some real random spammers 
in the original rating data sets, and the ratio is about 0.05. When 
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Figure 4: Performance of different methods on data sets with random spamming. Subfigures (a) and (b) are for Rc', subfigures (c) and (d) are for AUC. The parameter 
p is the ratio of random spammers. Results are averaged over 100 independent realizations. 


focusing on the overall performance, indicated by the values of 
AUC in Figs. ^ and|4]i, GR and IGR remarkably outperform 
the other methods by giving a robust AUC value around 0.96. 
CR and RR are slightly inferior as the AUC value is about 0.92. 
For IR, the AUC value is significant lower, indicating its lim¬ 
ited performance. In short, group-based methods outperform 
the quality-based methods in resisting random spamming. 

For a more intuitive understanding of how different meth¬ 
ods work in resisting random spamming, in Fig.[5l we show the 
effect of the user degree on reputation evaluation in parameter 
spaces (R, ku). In can be seen that R is positively correlated 
with ku in IR. Hence, for users with close degree, IR can ac¬ 
curately distinguish spammers from normal users as shown in 
Figs. [5^ and[5f. Despite of this, IR gives a lower reputation to 
many users (see Figs. [2^ and|2f) but a relatively higher repu¬ 
tation to spammers with large degree, which results in its poor 
performance. Meanwhile, CR gives all users (especially some 
small-degree spammsers) a relatively higher reputation, indi¬ 
cated by most of dots being in the middle and top of Figs. 
and[5]i. In other words, the mean of all users’ reputation in CR 
is relatively higher (see Figs.|2t) and[2^). By contrast, RR over 
limits all users reputation, as indicated by most dots being in the 
bottom of Figs. [5]: and[5]i, although it gives most spammers a 
lower reputation. In RR, a lot of users have zero reputation (see 
Figs. [2]: and[2li), which results in a high false positive rate in 
spam detection. GR and IGR both slightly prefer small-degree 
users as they give a lower R to larger degree users (see Figs. B 
and|5} for GR and Figs.|5t and|5| for IGR). In GR and IGR, the 


reputation is normal-like distributed and the spammers are al¬ 
ways assigned with a low R. These characteristics ensure both 
GR and IGR owning the best performance in evaluating user 
reputation. 

To quantify the effects of the user degree on ranking, we di¬ 
vide all users into three subgroups, namely. Low, Mid and High 
according to their degrees. As the evidence of the heavy-tailed 
(i.e., stretched exponential) distribution of the user degree 
there are only a small number of users who have large degree. 
To balance the number of users in each subgroups, the intervals 
of the user degree ku for groups Low, Mid and High are re¬ 
spectively set as [kjfiin^ kmi^ 0.\{kfYiax ~ kmin)\ \kmin 0*1 (^max “ 
kmin\ kfYiiyi 0 3{kfjiax ~ ^min)) nnd \kinin ^ •'^(kmax ~ kjnin), 

where k^m and kmax are the minimum and maximum values of 
ku. In each subgroup, AUC is calculated after applying the five 
methods. Accordingly, the relative ranks of these methods are 
obtained. Results are shown in Figs. [6^ and[6j) for MovieLens 
and Netflix, respectfully. It can be seen that IR has a limited 
performance for Low and Mid degree spammers. CR and GR 
have a good performance for High degree spammers but a poor 
performance for Low degree spammers. By contrast, GR and 
IGR outperform the other methods for Low degree spammers. 
In ranking All spammers, the order of these methods from the 
worst to the best is IR, RR, CR, GR and IGR. 

4.3. Malicious spamming analysis 

To evaluate the performance of different methods in resist¬ 
ing malicious spamming, we first generate artificial data sets 
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Figure 5: The relation between R and R is the reputation of users, obtained by applying different methods on data sets with random spamming, ku is the 
degree of users. The data points colored gray and pink stand for normal users and random spammers, respectfully. The parameter is set as /? = 0.1. Results in each 
subfigures are for one realization. 
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Figure 6: Comparison of difference methods in ranking random spammers with different degree ku. Subfigures (a) and (b) are for MovieLens and Netflix, 
respectively. According to ku. All users are divided into three subgroups, namely. Low, Mid and High. In each subgroup, AUC is calculated after applying different 
ranking methods. Accordingly, the relative ranks of these methods are obtained. The parameter is set as p = 0.1. Results are averaged over 100 independent 
realizations. 


with malicious spammers and then calculate Rc and AUC ac¬ 
cordingly. Results are shown in Fig. [71 When focusing on 
GR and IGR both have the best performance when the ratio of 
spammers p is small. It is worthy noticing that IGR is much 
more robust than GR, since the values of Rc in GR decrease 
faster than that in IGR as p increases (see Figs. [7^ and |7J)). CR 
and RR have the similar performance, and Rc values in the two 
methods increase as p increases. The performance of IR de¬ 
pends on the data sets, and overall it outperforms CR and RR. 
Moreover, we note that when p is small, the values of Rc in GR 
and IGR are all around 0.8, while the values in CR and RR are 
almost 0. These results suggest that there are some real mali¬ 


cious spammers in the original data sets, and GR and IGR are 
much better in resisting malicious spamming. Considering the 
overall performance indicated by AUC in Figs. [7]: and|7]l, IGR 
has the best performance as the values of AUC are over 0.95. 
GR method is not robust than IGR especially when p is large. 
CR and RR are robust against a large number of spammers as 
the AUC values are stabilized as about 0.92. Moreover, the per¬ 
formance of IR depends on the data sets. To conclude, in resist¬ 
ing malicious spamming, the group-based methods outperform 
the quality-based methods. 

To better understand how these methods work in resisting 
malicious spamming, in Fig. [8l we show the effect of the user 
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Figure 7: Performance of different methods on data sets with malicious spamming. Subfigures (a) and (b) are for R^, subfigures (c) and (d) are for AUC. p is the 
ratio of malicious spammers. Results are averaged over 100 independent realizations. 


degree ku on evaluating user reputation in parameter spaces (R, 
ku). IR gives a high R to large-degree spammers due to its 
preference to users of large ku (see Figs. [8^ and[8]r). CR has 
no obvious degree preference as it gives high R to some users 
regardless of their ku (see Figs. ^ and [8^). RR over limits 
all users R by giving a almost zero reputation to lots of users 
(see Figs.[8j: and[8]i), which increases the false positive rate in 
spamming detection. In GR and IGR, the reputation is normal¬ 
like distributed and the spammers are always assigned with a 
low R (see Figs.[8]i and[8j for GR and Figs. [8^ and[8] for IGR). 

To quantify the effects of the user degree ku on ranking, 
we show the relative ranks of different methods by AUC af¬ 
ter dividing all users into three subgroups according to ku in 
Figs. [9^ and 12). It can be seen that IR has better performance 
for Mid and High degree spammers. CR and GR perform better 
for High degree spammers. GR and IGR outperform the other 
methods for Low degree spammers although they are not com¬ 
petitive for High degree spammmsers. Nevertheless, in ranking 
All spammers, IGR again have the best performance. 

5. Conclusions and discussion 

In summary, we have proposed an iterative group-based rank¬ 
ing method in user reputation evaluation by introducing an iter¬ 
ative reputation allocation process into the original group-based 
ranking method. Specifically, when calculating the correspond¬ 
ing group sizes, ratings are assigned with higher weights if they 
come from users with high reputation, otherwise ratings are as¬ 


signed with lower weights. In the iteration, the user reputation 
and the corresponding group sizes are iteratively calculated un¬ 
til they become stable. Extensive experiments on two real data 
sets suggest that the proposed method remarkably outperforms 
the previous quality-based methods. Further, we provided some 
insights on the mechanism and analyzed the characteristics of 
these methods. Results suggest that the iterative refinement 
method remarkably prefers large-degree users, the correlation- 
based method and reputation redistribution method have no ob¬ 
vious degree preference, and the group-based methods slightly 
prefer small-degree users. 

From the macro analysis, the group-based ranking methods 
are distinguishable from the quality-based methods as the for¬ 
mer ones assign users’ reputation by considering their grouping 
behaviors while the latter ones are based on the estimation of 
objects’ true qualities. The stability of assigning low reputation 
to users with high rating error and the independence of the rep¬ 
utation from the user degree ensure the effective of the group- 
based ranking methods 114 ill . In fact, the proposed method is an 
improvement of the original group-based ranking method in¬ 
spired by the original resource-allocation process 112] and 
the iterative refinement method iS. In particular, compared 
to the original one, the proposed method is more robustness in 
resisting a large number of spammng attacks. That is mainly be¬ 
cause in the proposed method the ratings from users with poor 
reputation have less chance in forming big groups and the repu¬ 
tation is iteratively updated. Even though the number of spam¬ 
mers increases, the effect of spam ratings on the whole system 
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Figure 8: The relation between R and ku- R is the reputation of users, obtained by applying different methods on data sets with malicious spamming, ku is the 
degree of users. The data points colored gray and pink stand for normal users and malicious spammers, respectfully. The parameter is set as p = 0.1. Results in 
each subfigures are for one realization. 
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Figure 9: Comparison of difference methods in ranking malicious spammers with different degree ku- Subfigures (a) and (b) are for MovieLens and Netflix, 
respectively. According to ku. All users are divided into three subgroups, namely. Low, Mid and High. In each subgroup, AUC is calculated after applying different 
ranking methods. Accordingly, the relative ranks of these methods are obtained. The parameter is set as p = 0.1. Results are averaged over 100 independent 
realizations. 


is restricted and the reputation of spammers decays through the 
iterations. 

Our work provides a further understanding on the mech¬ 
anism of some user reputation evaluation methods and gives 
some insights on the significance of considering users’ group¬ 
ing behaviors in enhancing the algorithmic performance. The 
proposed method is not only better in accuracy and robustness, 
but also easier to be implemented. Traditionally, a well-performed 
method should be convergent to a unique reputation vector ii, 
however, most of the previous reputation-based ranking meth¬ 
ods cannot guarantee convergence 11241] . Although extensive 


simulations suggest that the proposed method can be converge, 
we still expect further theoretical analysis to justify it. More¬ 
over, the previous studies either assume a continuums of rating 
values such as the correlation-based method or underly the as¬ 
sumption of a discrete rating system such as the group-based 
method. In other words, how the continuous vs. discrete-valued 
ratings affect the user reputation evaluation is still an open issue 
and worth of further consideration ii. As future works, we 
could consider applying the proposed method to rating systems 
with higher-resolution scales iIsqI] and designing more reputa¬ 
tion evaluation methods that can make best use of users’ group- 
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